TEAM MEMBERS
The coronavirus COVID-19 pandemic is an unprecedented health crisis that has impacted the world to a large extent. According to WHO, mental disorders are one of the leading causes of disability worldwide, so considering that this pandemic has caused further complications to mental ailment. The stress, anxiety, depression stemming from fear, isolation, and stigma around COVID-19 affected all of us in one way or another. We could see that many people are losing their jobs and the elderly are isolated from their usual support network. The issue is that the effects of the pandemic and mental health, maybe for longer-lasting than the disease itself.
In this limelight, although the measures are taken to slow the spread of the virus, it has affected our physical activity levels, our eating behaviors, our sleep patterns, and our relationship with addictive substances, including social media. Into this last point, both our increased use of social media while stuck at home, as well as the increased exposure to disaster news past year, have amplified the negative effects of social media on our mental health. This motivates us to perform some diagnostic analysis of this pattern and portray some meaningful insights on global survey data.
As from our motivation, we decided to split our task into 3 mainstream objectives: Impact of distress level on global level, Twitter, Infodemics. Commencing from a basic questionnaire analysis we dive deep into streamline platforms to look into from where people seek information and how they perceive and assimilate within themselves and share to others which affects their productivity to a large extent.
The COVIDiSTRESS global survey is an international collaborative undertaking for data gathering on people’s experiences, behavior and attitudes during the COVID-19 pandemic. In particular, the survey focuses on psychological stress, compliance with behavioral guidelines to slow the spread of Coronavirus, and trust in governmental institutions and their preventive measures, but multiple further items and scales are included for descriptive statistics, further analysis and comparative mapping between participating countries. Data is being collected by committed researchers in 43+ countries. Data is collected through online means, mainly based on social snowballing, viral spread and help from interested partners including media. We perform a basic descriptive statistics to analyze the trend of the survey and report the how across different countries the participants have coordinated to this task.
A 2018 study conducted by MIT researchers and published in Science discovered that false stories on Twitter diffused quicker and more widely than truths. The study analyzed millions of tweets and concluded that the novelty and the emotional reactions of Twitter users may be a contributing factor. With that in mind, pause a beat before clicking on the share button. Every social media post should be viewed with skepticism prior to conducting your own fact checking algorithm. Do facts and figures include referenced sources? If so, is the data reliable? Is the data clearly explained and are the limitations addressed? Beware of false equivalency – comparing apples with pears. A good example of this is the volume of graphs currently being shared that compare countries with vastly different population densities. We plan to put forth a comparative analysis of twitter during this pandemic for 2020 and 2021.
We are the channels that spread the infodemic throughout our networks. If in any doubt of the validity of a message, even if forwarded from one person to another it is blindly passed on. These messages are designed to mimic inside information or a scoop on what is really happening that are implicitly given a fake seal of validity from a close contact. The government have been very upfront with sharing information as quick and reliably as possible. Its imperative that we do not get side tracked by word-of-mouth messages from loosely connected official sources. It takes less than a second to loose our concentration and hit the forward button giving the infodemic further fuel. Even if you think your network are big and strong enough to make up their own mind, there may be people more at risk to these types of malicious messages. Just as with the pandemic, less digitally fluent people find it hard to decipher such messages causing undue stress and harm. This motivates us to pose this clear and strong question to make a trust graph based on all popular communication channels to assess their Infodemic Risk Index (IRI) scores.
(The COVIDiSTRESS global survey is an open science collaboration, created by researchers in over 40 countries to rapidly and organically collect data on human experiences of the Coronavirus epidemic 2020.) Dataset can be downloaded here: (Andreas Lieberoth 2020) https://osf.io/z39us/
We aim to work on the most recent dataset aggregated from Twitter using twitteR and rtweet libraries within a particular time and location.
Here twitteR which provides an interface and access to Twitter web API respectively, rtweet which acts as the client for Twitter’s REST and stream APIs will be used to retrieve data.
Data scraping techniques we heavily part of this like removing stop words, emoji’s, cryptic characters and also text conversion to lowercase to maintain semantic integrity.
(The Role of Trust and Information during the COVID-19 Pandemic and Infodemic) Dataset can be downloaded here: [R. Gallotti, N. Castaldo, F. Valle, P. Sacco and M. De Domenico, COVID19 Infodemics Observatory (2020). DOI: 10.17605/OSF.IO/N6UPX] [Van Mulukom, V. (2021, May 15). The Role of Trust and Information during the COVID-19 Pandemic and Infodemic. https://doi.org/10.17605/OSF.IO/GFWBQ] https://osf.io/n6upx/, https://osf.io/67zhg/, https://osf.io/rtacb/, https://osf.io/dh879/, https://osf.io/c37wq/
These datasets comprises of summary of infodemics data collected from across countries, the world risk index, population emotional state, and news reliability.
( All the above listed datasets can be accessed via our Github repository linked at the footer of this notebook. )
First we analyze how the IRI took form across the world during the first quarter of 2020.
RESOURCES_DIR_PATH <- getwd()
INFODEMIC_REDUCED_FILE_PATH <- file.path(RESOURCES_DIR_PATH, "infodemics_reduced.csv")
WORLD_RISK_INDEX_FILE_PATH <- file.path(RESOURCES_DIR_PATH, "world_risk_index.csv")dat.red <- read.table(INFODEMIC_REDUCED_FILE_PATH, header = T, sep = ";" )
dat.red$date <- as.Date(dat.red$date)
dat.redThis dataset has the iso3 country code and the volume of the infodemic score spread across continents which we find is very vital for extracting specific insights to get an idea of how the IRI has evolved across the timeline.
dat.iri.world <- read.table(WORLD_RISK_INDEX_FILE_PATH, header = T, sep = ";")
dat.iri.worldThis dataset has the world risk index for each day in 2020. We extract these data and use for our analysis which will be shown in a timeline with a focus on the risk index.
Here we plan to see how the behavior of IRI has evolved across the timeline and its change in wavelength to see how periodically the cycle repeats.
COUNTRY <- "ITA"
#COUNTRY <- c("ITA","VEN")
dat.red.country.tmp <- dat.red[
which(dat.red$iso3 == COUNTRY),
c("date", "IRI_UNVERIFIED", "IRI_VERIFIED")
]
dat.red.country <- data.frame(
date = dat.red.country.tmp$date,
Unverified = (dat.red.country.tmp$IRI_UNVERIFIED),
Verified = (dat.red.country.tmp$IRI_VERIFIED)
)
tmp.cum.mean <- dplyr::cummean(rowSums(dat.red.country[order(dat.red.country$date), 2:3]))
dat.red.country <- melt(dat.red.country, id.vars = "date")
dat.red.country.epi <- data.frame(
date = dat.red[which(dat.red$iso3 == COUNTRY), ]$date,
epi.new = c(0, diff(dat.red[which(dat.red$iso3 == COUNTRY), ]$EPI_CONFIRMED))
)
dat.red.country.epi[which(dat.red.country.epi$epi.new == 0), ]$epi.new <- NA
dat.red.country.cummean <- data.frame(
date = dat.red[which(dat.red$iso3 == COUNTRY), ]$date,
Cum.Mean = tmp.cum.mean
)
ggplot() +
theme_bw() +
theme(panel.grid = element_blank()) +
geom_point(data = dat.red.country.epi,
aes(date, size = epi.new),
y = 0.9, alpha = 0.5,
color = "tomato") +
geom_histogram(data = dat.red.country,
aes(date, value, fill = variable),
stat = "identity",
type = "stacked",
position = position_stack(reverse = TRUE)) +
scale_fill_manual(name = "",
values = c('#4DBBD5FF', '#3C5488FF')) +
ylab("IRI") +
xlab("Timeline") +
ylim(c(0, 1)) +
guides(size = guide_legend(title = "New Cases")) +
geom_text(data = dat.red.country.epi,
aes(x = date, label = epi.new),
angle = 90,
y = 0.97,
size = 2,
color = "grey30") +
geom_line(data = dat.red.country.cummean,
aes(date, Cum.Mean),
linetype = "dashed",
color = "grey30")+
ggtitle("IRI scores in Italy")It is very important to note that in the early 2020 there are extremely very large volume of new cases that caused outbreak of this havoc starting from March and it increases by 500 cases per day. Also it is salient that a large number of potential unverified cases is noticeable during the entire timeline.
It is surprising to see the absence of new cases in Venezuela but the cumulative mean is very high for all the months and is 0.87 on 17-02-2020.
For the first quarter of 2020 from mid of January to mid of March we analyze its effects across different countries.
library(RColorBrewer)
col <- brewer.pal(9, "YlGnBu")
idxs.sub <- which(dat.red$TWI_VOLUME > 2000 & dat.red$EPI_CONFIRMED > 100)
country.sub <- as.character(unique(dat.red[idxs.sub, ]$iso3))
dat.red.sub <- dat.red[which(dat.red$iso3 %in% country.sub),
c('date' ,'iso3', 'IRI_ALL')]
ggplotly(ggplot(dat.red.sub, aes(x = date,
y = reorder(iso3, IRI_ALL),
fill = IRI_ALL)) +
geom_tile() +
theme_bw() +
scale_fill_gradientn(colors = col, limits = c(0, 1), name = "IRI") +
theme(panel.background = element_blank(),
panel.grid.major = element_blank(), legend.position = 'top') +
ylab('Country') +
xlab('Timeline') +
geom_hline(yintercept = c(seq(1.5, 21, 1)), color = 'grey70') +
scale_x_date(expand = c(0, 0))+ggtitle("IRI scores across 20 different countries"))It can be seen that in countries like Iranthe IRI score very pretty high as compared to few other European countries from second half of January to first half of March 2020.
We assess and compare the cumulative number of reported cases which are grouped into 6 different bins for all countries.
dat.corr2 <- data.frame()
dat.corr <- dat.red[, c("date", "iso3", "EPI_CONFIRMED", "IRI_ALL")]
for(cc in unique(dat.corr$iso3)){
tmp <- dat.corr[which(dat.corr$iso3 == cc), ]
tmp <- tmp[order(tmp$date), ]
tmp$EPI_CONFIRMED_DAILY <- c(0, diff(tmp$EPI_CONFIRMED))
tmp$IRI_ALL_CUMMEAN <- dplyr::cummean(tmp$IRI_ALL)
dat.corr2 <- rbind(dat.corr2, tmp)
}
dat.corr2 <- dat.corr2[! is.na(dat.corr2$IRI_ALL), ]
dat.corr2 <- dat.corr2[- which(dat.corr2$EPI_CONFIRMED == 0), ]
bin <- rep(0, nrow(dat.corr2))
bin[which(dat.corr2$EPI_CONFIRMED <= 2 )] <- 0
bin[which(3 <= dat.corr2$EPI_CONFIRMED & dat.corr2$EPI_CONFIRMED < 8)] <- 1
bin[which(8 <= dat.corr2$EPI_CONFIRMED & dat.corr2$EPI_CONFIRMED < 16)] <- 2
bin[which(16 <= dat.corr2$EPI_CONFIRMED & dat.corr2$EPI_CONFIRMED < 51)] <- 3
bin[which(51 <= dat.corr2$EPI_CONFIRMED & dat.corr2$EPI_CONFIRMED < 10001)] <- 4
bin[which(10001 <= dat.corr2$EPI_CONFIRMED & dat.corr2$EPI_CONFIRMED < 81000)] <- 5
dat.corr2$bin <- bin
labels.min <- dat.corr2 %>% group_by(bin) %>% summarise_at(vars(EPI_CONFIRMED), min)
labels.max <- dat.corr2 %>% group_by(bin) %>% summarise_at(vars(EPI_CONFIRMED), max)
lab <- paste0(labels.min$EPI_CONFIRMED, '-', labels.max$EPI_CONFIRMED)
lab[5:6] <- c('51-9999', '10000+')
ggplotly(ggplot(dat.corr2, aes(as.factor(bin), IRI_ALL_CUMMEAN)) +
theme_bw() +
theme(panel.grid = element_blank(),
legend.position = "none") +
geom_boxplot(aes(fill = as.numeric(bin)),
size = 0.15,
outlier.color = "grey70",
color = "grey70", notch = TRUE) +
xlab("Cumulative Number of Reported Cases") +
ylab("IRI Cumulative Mean") +
scale_fill_viridis_c() +
scale_x_discrete(labels = lab) +
ggtitle("Cumulative IRI vs. Epidemic per index confirmed"))Debriefing
The sample sizes of the reported cases for the groups in the range of 3-7, 8-15 are reasonably symmetric indicative of less variability in the analysis but, for groups 1-2, 16-50, 51-9999, 10k+ are left-skewed signifies some level of variability.
The medians of all the 6 groups are different benchmarking there is a significant difference between all the groups relative to the cumulative IRI taken into consideration.
Also to be noted that for case group 1-2 there are seemingly more outliers than the other 4 case groups but not for the case group 10k+. In addition, for case group 51-9999 the two outlier points above it’s upper fence appear overlapping and resembles an outlier cluster.
This plot shows IRI vs confirmed COVID-19 cases for Infodemics and Epidemics data aggregation by Country and at a border level categorized into continent.
x0 <- aggregate(TWI_VOLUME ~ iso3, dat.red, mean)
colnames(x0) <- c("Country", "Message.Volume")
x1 <- aggregate( EPI_CONFIRMED ~ iso3, dat.red, max)
colnames(x1) <- c("Country", "Infected")
x2a <- aggregate( IRI_UNVERIFIED ~ iso3, dat.red, mean)
colnames(x2a) <- c("Country", "Risk Unverified")
x2b <- aggregate( IRI_VERIFIED ~ iso3, dat.red, mean)
colnames(x2b) <- c("Country", "Risk Verified")
tab <- merge(x0, x1, by = "Country")
tab <- merge(tab, x2a, by = "Country")
tab <- merge(tab, x2b, by = "Country")
tab$Info.Risk <- tab[, "Risk Verified"] + tab[, "Risk Unverified"]
tab$Continent <- countrycode(tab$Country, 'iso3c', 'continent')
tab <- tab[order(tab$Info.Risk),]
tab <- tab[! is.na(tab$Continent), ]
rownames(tab) <- NULL
Infect.thres <- 0
idxs <- which(tab$Infected > Infect.thres & ! tab$Country %in% c("CHN", "TWN", "IRN"))
ggplot(tab[idxs, ], aes(Info.Risk, Infected, color = Continent, size = Message.Volume)) +
theme_bw() +
theme(panel.grid = element_blank()) +
stat_smooth(method = 'lm',
linetype = "solid",
color = "red",
alpha = 0.2,
size = 0.25,
se = T) +
geom_point(alpha = 0.7) +
scale_color_npg() +
geom_text_repel(aes(label = Country),
show.legend = F,
seed = 786) +
scale_y_log10() +
geom_vline(xintercept = median(tab$Info.Risk[idxs], na.rm = T),
linetype = "dashed",
color = "#dadada") +
geom_hline(yintercept = median(tab$Infected[idxs], na.rm = T),
linetype = "dashed",
color = "#dadada") +
xlab("IRI") +
ylab("Confirmed COVID19 Cases") +
stat_smooth(linetype = "dashed",
color = "black",
alpha = 0.2,
size = 0.35,
se = F ) +
labs(size = 'Volume')+
ggtitle("Showing confirmed COVID-19 cases across countries and the IRI score")Debriefing
It can be inferred that the IRI volume is very high in USA with approximately 383,210 cases. Also, we have focused to show this critical impact at top 5 continents around the globe. Moreover it has to be noted that the IRI score is very high Peru which is almost around 0.98 although the total infected cases are 11.
The most common media sources where people sought to get the information from we show the trust level for these.
data_12 <- read.spss("DATA_COVID19_TrustInformation.sav",
use.value.labels = FALSE,
to.data.frame = TRUE)
names(data_12) <- tolower(names(data_12))
data_12